Quarterly and Annual Ridership Totals by Mode of Transportation 1
The purpose of this data is to gain a baseline perspective of the current state of public transit usage in the United States. Therefore, this data set should be cleaned in a way that trends can be visualized, without including superfluous information that does not relate to any current phenomena. The steps used in cleaning this data are below.
Trim the data set:
Columns 1 to 11 to trim blank items in the .csv file, as well as notes put in by the source.
Rows 81 to 133 to remove records from prior to 2010, as those are superfluous when comparing to current trends.
Create one column to account for year and quarter to improve readability
Convert all numeric rows to numeric data type
Remove extra year and quarter columns as they are now unnecessary
Regarding the numeric fields, I have chosen to keep them all for now as each one can provide insight into which modes of transportation are most affected by certain factors. Below is the code to apply the steps laid out, as well as a comparison between the raw and cleaned data sets.
Code
import pandas as pdimport datetimefrom datetime import datetime
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
The main objective of cleaning this data is to narrow down the fields to remove superfluous columns, and to decode the numerical values that the dataset has in place of categorical values. To do this, we will reference the glossary that accompanies the dataset. The steps are the following:
Remove columns that provide excess detail
Rename columns
Remove columns that will not be necessary for any analysis techniques to be used later on
Replace codes for sex, marital_status, race, hispanic, employment, metropolitan_status, and transportation_type
Codes for metropolitan_status and transportation_type are aggregated to simplify data (e.g., all public transit types are labeled Public Transit)
Set age and income to numerical data types
Set all values where income is 0 and the person is not in the labor force to NA
Drop all rows where transportation_type is NA, as those are not labeled
Set all placeholder values for city_population to NA
This data was gathered to attempt to find differences in public transit system performance by city. However, the raw data comes in a slightly different form. The observational unit is mode of transportation, separated by transit agency. For our purposes, we want the observational unit to be each city, so many of these rows must be consolidated. To accomplish this, it is important to understand which values are should be summed (i.e., counting variables), and which should be added as proportions of the total. For each, column, a formula must be applied to ensure proper consolidation. The steps for cleaning this dataset are as follows:
Remove rows with unnecessary information
Remove rows in which the Most Recent Report Year is not 2022, the latest year with sufficient data
Initialize new dataframe to insert consolidated rows
Length equal to the number of unique city names (i.e., number of cities)
Set Population to the population value associated with each city
Set Area to the area value in square miles associated with each city
Set Cost_per_trip, Fare_per_trip, and Miles_per_trip
As these are all average values, this is done by multiplying the value for each transportation type by the number of passenger trips to properly weigh that data point, sum all of those values, and divide by the total number of passenger trips for that city
Compute Trips_per_capita as total trips divided by population
Write the resulting dataframe to a .csv file
The code and output from cleaning this dataset are below:
The purpose of cleaning this data is to perform Naive Bayes classification in the future, as we have labeled text data that can be valuable for analyzing how people express their opinions on public transit systems. In the raw data that was obtained, there are duplicates on each page which must be dealt with, as well as a need for correcting the data types. Since this process must be iterated seven times to account for each transit organization, we will create a generalized fumction to be called upon for each city involved. The steps in this function are:
Remove excess columns
Remove rows where review is duplicated (date is NA in these records, so we drop based on that)
Add a column to keep track of which agency the review is about
Change column names
Take just the numerical rating and set to integer type
Set Date field to date type
Append to main dataframe using pd.concat() function
The code for this function is below:
Code
def clean_yelp(x, y): df = pd.read_csv('../data/yelp_reviews/'+ x +'_reviews.csv') df = df.drop(columns='Unnamed: 0') df = df[df['1'].notna()] df['Agency'] = x df = df.rename(columns={'0': 'Rating', '1': 'Date', '2': 'Review'}) df['Rating'] = df['Rating'].str[0].astype(int)for i in df['Date']: i = datetime.strptime(i, "%b %d, %Y") total = pd.concat([y,df])return(total)
Next, we will call the function for each of our seven cities. The code and output for this are below:
Remote Work Trends - Desires of Employers vs. Workers 11
The insight to be gathered from this data would be the discrepancies between what employers want from their workers’ remote work schedule, and those of the workers themselves. Therefore, while these come from two separate .csv files, it will be necessary to merge these data sets into one data frame. Additionally, each data set has two variables: 1. The amount of working from home (days per week) employers or workers want for all workers 2. The amount of working from home (days per week) employers or workers want for workers able to work from home Since both of these have ample data, we will keep both. The methodology for this is as follows:
Read both data sets and trim excess space where the owner of the data had included a citation note
Merge by date
These data sets come from the same series of surveys, so the date column is exactly the same, eliminating any need for removal of rows.
Convert the date field to a date data type and order by date
Rename columns based on glossary provided by the data source
Ensure numeric columns have numeric data type
Remove rows in which there are too many NA values.
Rows in which the values for all workers OR workers able to work from home have NA values can be kept, as there is a comparison to be made with the ones that don’t have NA values. Only rows in which no comparison can be made will be removed.
The code for this is below, along with a screenshot of the cleaned data.
“Ridership Report.” American Public Transportation Association, 21 Sept. 2023, www.apta.com/research-technical-resources/transit-statistics/ridership-report/.↩︎
Steven Ruggles, Sarah Flood, Matthew Sobek, Danika Brockman, Grace Cooper, Stephanie Richards, and Megan Schouweiler. IPUMS USA: Version 13.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D010.V13.0↩︎
“Raw monthly ridership (no adjustments or estimates),” Raw Monthly Ridership (No Adjustments or Estimates) | FTA, https://www.transit.dot.gov/ntd/data-product/monthly-module-raw-data-release (accessed Nov. 14, 2023).↩︎
“Metropolitan Transportation Authority - New York, NY,” Yelp, https://www.yelp.com/biz/metropolitan-transportation-authority-new-york-6 (accessed Nov. 14, 2023).↩︎
“Metro Los Angeles - Los Angeles, CA,” Yelp, https://www.yelp.com/biz/wmata-washington (accessed Nov. 14, 2023).↩︎